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Abstract 


We  investigate  the  behavior  of  neural  network  learning  algorithms  with  a  .small, 
constant  learning  rate  e  in  stationary,  random  input  environments.  It  is  rigor- 
ously established  that  the  sequence  of  weight  estimates  can  be  approximated 
by  a  certain  ordinary  differential  equation,  in  the  sense  of  weak  convergence 
of  random  processes  as  e  tends  to  zero.  As  applications,  back-propagation  in 
feedforward  architectures  and  some  feature  extraction  algorithms  are  studied  in 
more  detail. 


1      Introduction 

For  understanding  the  performance  of  neural  network  learning  algorithms,  it  is  of 
fundamental  importance  to  investigate  how  they  behave  in  stationary  random 
input  environments.  This  analysis  yields  information  about  the  asymptotic 
properties  of  the  learned  connection  weights  as  the  number  of  training  samples 
increases  without  bound.  Thus  far,  only  algorithms  with  learning  rates  tending 
to  zero  have  been  studied.  However,  most  neural  network  learning  is  conducted 
using  a  small,  constant  learning  rate.  In  this  paper,  we  investigate  the  limiting 
behavior  of  such  algorithms. 

An  on-line  (local)  learning  algorithm  can  be  written  as 

en+l^On+r,,,Q{Zn,B„),  (1) 

where  9  is  the  A:-dimensional  vector  of  network  weights  to  be  learned  and  its 
current  estimate  at  time  n  is  denoted  by  9^,  -n  is  the  training  pattern  presented 
at  time  n,  /?„  is  the  learning  rate  employed  at  time  n,  and  Q{-,  •)  is  a  suitable 
function  characteristic  of  the  algorithm. 

The  key  tool  in  the  analysis  of  the  sequence  {9n]  is  the  so-called  interpolated 
process  (9{t).t  >  0),  usually  defined  by 

e{t)  =  9n,  t„<t<tn  +  U  (2) 

where 

to  =  0.  tn  =  rii  + h  Jin- 

(Rather  than  working  with  the  above  piecewise  constant  interpolation  of  the  se- 
quence {9n},  one  could  also  use  piecewise  linear  interpolations.  The  asymptotic 
properties  of  these  two  processes  are  very  similar.) 

If  T]n  tends  to  zero  at  a  suitable  rate,  it  can  be  shown  that  the  interpo- 
lated process  of  {9n}  eventually  follows  a  solution  trajectory  of  a  correspond- 
ing ODE  (ordinary  differential  equation)  with  probability  one  (Ljung,  1977; 
Kushner  &  Clark,  1978).  This  has  allowed  researchers  to  draw  very  valuable 
conclusions  about  the  convergence  behavior  of  {On},  see  e.g.  Oja  (1982),  Oja  Sz. 
Karhunen  (1985),  White  (1989),  Sanger  (1989),  Hornik  k  Kuan  (1990),  Kuan  k 
White  (1990). 

However,  if  r;n  =  e,  a  small  constant,  the  estimates  {0„)  never  stabilize, 
and  the  analysis  of  the  interpolated  processes  cannot  be  carried  out  for  fixed  c. 
Hence,  we  are  interested  in  asymptotics  where  e  becomes  smaller  and  smaller. 

In  section  2  of  this  paper,  we  shall  utilize  a  result  by  Kushner  (1984)  to 
establish  that  as  e  tends  to  zero,  the  corresponding  interpolated  processes  con- 
verge weakly  to  the  solution  of  the  associated  ODE.  The  result  is  then  applied 
to  two  of  the  most  important  network  architectures  in  section  3.  The  proof  of 
the  main  theorem  is  deferred  to  the  appendix. 


2      Convergence  to  the  ODE  limit 

Consider  the  algoritlim  (1)  with  rjn  =  e.    In  this  case,  we  shall  write  {0'^ 
the  sequence  of  estimates  to  emphasize  the  dependence  on  e,  i.e. 


lor 


0:^  +  l=0'n+^Qi^nJ'J.  (:}) 

The  corresponding  interpolated  process  {6'^{t),t  >  0)  is  given  by 

e'(t)^9'^,  m  <t  <(n+l)c.  (■!) 

Our  key  result  is  that  for  e  —  0,  the  interpolated  processes  can  be  approximated 
by  the  solution  of  an  ODE,  in  the  sense  of  weak  convergence  of  random  processes. 
We  have  the  following  definition. 

Definition.  Let  {(^%  e  >  0}  6e  a  family  of  random  elements  with  values  ni 
some  metric  space  {X,p).    We  say  that  ^'  converges  weakly  to  ^°,  symholicalhj 

lim,_oE/(e)  =  E/(^0) 

for  all  bounded,  continuous  real  functions  f  on  X. 

Weak  convergence  is  an  extension  of  the  familiar  concept  of  convergence  in 
distribution  of  sequences  of  IR  -valued  random  variables  to  families  of  abstract 
valued  random  elements;  as  a  basic  reference,  we  recommend  Billingsley  (1968). 
If  /i  is  a  continuous  function  from  A'  to  IR  and  ^^  =>  ^°,  then  h{^^)  -^x>  ^(i>°)i 
where  " — x>"  denotes  convergence  in  distribution.  (Billingsley,  1968,  page  29). 
In  our  case,  we  shall  regard  the  interpolated  processes  9^{-)  as  random  pro- 
cesses with  values  in  A'  =  D*^[0,T'oo),  the  space  of  all  functions  from  [0,rcc) 
to  IR  which  are  right  continuous  with  left-hand  limits  at  every  0  <  t  <  Tr,^. 
Here,  T^c,  is  the  supremum  over  all  T  such  that  the  limiting  ODE  has  a  unicine 
solution  on  [0,T]  with  probability  one;  in  particular,  if  it  has  a  unique  global 
solution  on  [0,oo)  for  every  initial  condition,  then  Too  =  oo-  As  a  metric  on  A' 
we  shall  use 

Eoo 
2-'"  min    l,supo<,<^^|^(0  -  ^(/)|  ,  ^^  ,;  €  A, 

m=  1  _    _     "• 

where  {Tm}  is  an  increasing  sequence  with  lim,n-_co  Tm  =  Tco,  such  that  A'  is 
given  the  topology  of  uniform  convergence  on  bounded  subintervals  of  [0,Tcx3). 

In  order  to  obtain  the  ODE  limit  of  0'^{-).  we  must  be  able  to  average  out 
the  randomness  resulting  from  the  pattern  sequence  {-n},  so  that  the  limiting 
process  (as  e  tends  to  zero)  eventually  follows  the  "mean"  behavior.  This  is  the 
basic  idea  of  the  so-called  direct  averaging  method  of  Kushner  (1984,  chapter  5), 
cf.  also  Kushner  k  Shwartz  (1984). 


We  assun^e  the  following  conditions. 

[A  1]   {;„}  IS  a  (strictly)  stationary  and  ergodic  sequence  of  random  vectors. 

[A  2]   For  each  N.  there  exists  a  function  L^{z)  such  that  Li\(:„)  is  intajrublc 
and 

suP|,|,|,-|<.v  \Q{z,e)-Q{z,6)\  <  Ls{z)  \e-o\.  (5) 

[A3]   For  each  9.  Q{:n-d)  is  square  integrable  with  expectation 

Q(e):=EQ(z„,e). 

These  conditions  are  not  the  weakest  possible,  but  they  can  easily  be  verified 
or  interpreted.  The  stationarity  and  ergodicity  assumption  [A  1]  applies  when 
we  have  time  series  data;  in  particular,  it  is  satisfied  when  the  training  patterns 
are  identically  distributed,  independent  random  variables.  [A 2]  is  a  Lipschitz- 
type  of  smoothness  condition  on  the  function  Q,  which  is  satisfied  in  many 
interesting  neural  network  applications,  as  shown  in  later  examples.  [A  3]  defines 
the  correponding  ODE.  Of  course,  conditions  [A  2]  and  [A3]  are  met  if  the 
patterns  are  bounded  and  Q  is  continuously  diflferentiable  in  both  arguments. 

The  following  theorem  is  based  on  theorem  5.1  in  Kushner  (1984);  its  proof 
is  given  in  the  appendix. 

Theorem.  Assume  [A  1]  to  [A3]  and  let  9q  =  ^o,  a  fixed  vector  or  a  random 
vector  independent  of  e.    Then 

e'{-)^e{-),  (6) 

where  6{-)  is  the  solution  of  the  ODE 
In  particular,  if  0  <  ti  <  ■  ■  ■  <ti  <  Too, 

(Otj, ei,,)  ^v  (0iti) e(t,)).  (7) 

If  in  addition  Oq  is  nonrandom,  then  for  each  T  <  Tco , 

suPo<(<T  l^'(0  -  ^(01  —  0      in  probability.  (8) 


Assuming  9o  to  be  nonrandom  does  not  really  impose  a  restriction;  in  fact,  in 
virtually  all  neural  network  applications,  the  initial  ^o  is,  although  probably  cho- 
sen at  "random"  with  the  aid  of  some  random  number  generator,  independent 
from  the  learning  patterns,  and  we  can  analyze  the  behavior  of  {9^}  conditional 
on  the  initial  weights. 


The  above  theorem  establishes  a  very  close  relation  between  the  sequence  of 
estimates  generated  by  a  neural  network  learning  algorithm  with  small  constant 
learning  rates  and  the  solution  paths  of  its  associated  ODE.  Let  6^  be  the  sot 
of  all  "desired"  limit  points  of  the  algorithm;  usually,  0^  consists  of  all  0  which 
minimize  some  criterion  function,  e.g.  the  mean  square  error  when  appro.xi- 
mating  targets  by  actual  network  outputs  in  supervised  learning.  Clearly,  we 
expect  the  algorithm  to  perform  well  if  the  domain  of  attraction  of  0^  contains 
all  (or  most)  reasonable  initial  configurations.  In  particular,  0^  should  contain 
at  least  one  asymptotically  stable  equilibrium  of  the  ODE.  Of  course,  optimal 
performance  is  achieved  if  there  is  a  single,  globally  attractive  equilibrium;  how- 
ever, we  are  unaware  of  any  neural  network  learning  algorithm  which  ha-s  this 
property.  Taking  Error  Back-Propagation  (EBP)  as  the  most  prominent  exam- 
ple, it  is  well  known  that  the  error  functions  for  most  architectures  contain  a 
multiplicity  of  local  minima  which  are  equilibria  of  the  associated  ODE. 

A  global  asymptotic  analysis  of  the  solution  paths  of  the  ODE  and  in  partic- 
ular, an  explicit  characterization  of  the  domains  of  attraction  of  the  equilibria, 
usually  cannot  be  carried  out.  Nevertheless,  reasonable  performance  can  be 
expected  in  the  cases  where  the  set  of  all  asymptotically  stable  equilibria  is  con- 
tained in  Qd\  some  of  the  feature  extraction  algorithms  investigated  in  the  next 
section  have  this  property.  However,  it  has  to  be  pointed  out  that  it  may  already 
be  impossible  to  give  a  closed  analytic  description  of  the  set  of  all  equilibrium 
points  (cf.  EBP). 

3      Applications 

3.1      Error  Back-Propagation 

Consider  the  following  learning  problem.  We  are  given  a  feedforward  network 
architecture  with  output  o  =  F{x,9);  here,  x  is  the  network  input,  9  is  the 
vector  of  all  adjustable  weights  and  F  is  a  function  characteristic  of  the  network 
topology.  Suppose  that  we  are  also  given  a  sequence  of  identically  distributed 
training  patterns  ;„  =  (^n^yn)  and  that  it  is  desired  to  adjust  the  network 
weights  in  a  way  that  the  mean  square  approximation  error 

$(0)  =  iE|y„-F(x„,^)|2 

is  minimized.  The  most  prominent  (on-line)  algorithm  which  has  been  proposed 
in  the  neural  network  literature  is  the  error  back-propagation  (EBP)  algorithm 
(Rumelhart,  Hinton  k.  Williams,  1986),  which  is  a  simple  gradient  descent  al- 
gorithm allowing  for  efficient  updating  of  the  weights  due  to  the  feedforward 
architecture.  In  this  case, 

Q{x,y,0)  =  VFix,6)iy-F(x,6)),         Q(^)  -  EQ(x„,y,,^)  =  -V^0). 


(The  "V"  symbol  denotes  taking  all  partial  derivatives  with  respect  to  the  com- 
ponents of  9.)  Hence,  if  the  conditions  of  our  theorem  are  satisfied,  we  have 
$'{■)  ^  e(-),  where  e{)  solves  the  ODE 

e  =  -V^(9). 

A  simple  application  of  the  chain  rule  yields  that 

thus  $  is  strictly  reduced  along  the  solution  paths  of  the  ODE  unless  an  equilib- 
rium is  reached.  If  we  knew  in  addition  that  the  level  sets  0;v  =  {9  '■  ^{9)  <  A'} 
are  bounded  uubsetsof  IR  ,  we  could  conclude  that  the  set  of  equilibria  is  globally 
attractive;  however,  this  condition  is  not  satisfied  in  many  important  applica- 
tions, e.g.  in  multilayer  feedforward  architectures  with  bounded  hidden  layer 
activation  functions,  or  in  linear  architectures  with  a  bottleneck  layer  as  de- 
scribed in  Baldi  iL'  Hornik  (1989).  Therefore,  although  the  local  minima  of  <I> 
are  of  course  the  asymptotically  stable  equilibria  of  the  ODE,  it  is  not  neces- 
sarily true  that  the  solution  paths  of  the  ODE  converge  to  a  local  minimum  of 
$  for  "most"  (e.g.  in  the  sense  that  the  exceptional  set  has  Lebesgue  measure 
zero)  initial  values,  as  is  very  often  claimed. 

If  we  assume  that  the  training  patterns  Zn  have  bounded  fourth  moments, 
conditions  [A  2]  and  [A  3]  can  easily  be  verified  for  the  usual  multilayer  multiout- 
put  feedforward  architectures  with  logistic  or  arctangent  hidden  layer  activation 
functions.  As  a  (notationally  convenient)  illustration,  consider  the  following  sin- 
gle hidden  layer  network  where  the  i-th  output  component  o,  is  given  by 

o,  =  fi{x.9)  =  g^  r^^^^f^thy-'h  (  ^^^j"/»;'^j]  J  ■  '  =  1 P; 

here,  d,  q  and  p  are  the  numbers  of  input,  hidden  and  output  units,  respectively, 
^j  is  the  j-th  input  component,  and 

9  =  (an,  ..  .,a,<i,/?ii ,0pq)- 

We  then  have  the  following  result. 

Corollary  1.  Suppose  that  all  activation  functions  xl'h  fl"<^  9i  fJ^'e  twice  contin- 
uously differentiable  and  that  in  addition,  all  hidden  layer  activation  functions 
iph  are  bounded  and  have  bounded  derivatives  up  to  order  two.  Then,  if  the 
sequence  of  training  patterns  {:„}  is  stationary  and  ergodic  with  finite  fourth 
moments, 

9'{)->9{-), 

where  9{)  solves  the  ODE 

9^-V^9),         9(0)  =  00. 


For  the  proof,  let  us  start  by  observing  that  under  the  above  conditions,  the 
network  output  F{x,6)  is  uniformly  bounded  in  x  over  the  weight  sets  where 
1^1  ^  ^-  The  nonzero  entries  in  VF{x,6)  are  of  the  form 

oPth  oat,  J 

where  ph  =  Ylj  '^*^i^j  ^""^  ^'  ~  Zlh  l^ih^h{Ph)-  Hence,  we  can  find  a  finite 
constant  Ca'  such  that 

|VF(j,^)|  <  cvki,        1^1  <  ^. 

Similarly,  it  can  be  shown  that  there  is  a  finite  constant  D\  such  that  all  second 
order  partials  of  F  with  respect  to  components  of  9  can  be  bounded  by  Ds\x\, 
uniformly  in  x  over  {|^|  <  A''}. 

We  conclude  that  if  in  addition  inputs  and  output  have  finite  fourth  mo- 
ments, then  Q{xn,yn,0)  is  square  integrable  by  Schwarz's  inequality  and  [A3] 
is  satisfied.  If  we  let  Li\{x,  y)  :=  supM|<;v  ^Q{^<  V^  ^)i  then  (5)  is  satisfied  and, 
again  using  the  above  estimates,  we  see  that  L!^{xn,yn)  is  integrable,  whence 
[A  2]. 

3.2      Feature  Extraction  Algorithms  for  Linear  Networks 

For  many  applications  it  is  very  important  to  train  networks  to  be  able  to  extract 
the  main  features  inherent  in  high-dimensional  input  data  streams,  thereby  sig- 
nificantly reducing  data  dimensionality.  Generally  speaking,  we  are  looking  for 
functions  F  which  compress  a  d-dimensional  input  vector  x  into  a  p-dimensional 
output  vector  y  =  F{x)  (where  p  <  d  and  usually  p  <C  rf)  such  that,  in  a  sense 
to  be  made  more  precise,  y  contains  "as  much  information  about  j;  as  possible". 

If  we  use  the  mean  square  error  of  the  best  linear  estimate  of  x  given  y  (the 
"linear  reconstruction  error")  as  a  criterion,  this  leads  to  a  statistical  technique 
known  as  Principal  Component  Analysis  (PCA),  see  Bourlard  S/.  Kamp  (1988), 
Linsker  (1988),  Sanger  (1989),  Baldi  k  Hornik  (1991).  In  this  case,  the  outputs 
are  of  the  form  y  =  Wx,  and  the  set  of  all  optimal  p  x  d  matrices  W  can  be 
described  cis  follows.  Let  Ai  >  •  •  •  >  A^  be  the  eigenvalues  of  the  input  covari- 
ance  matrix  E  =  E  Xnx'n  (in  what  follows,  '  denotes  transpose),  and  assume  for 
simplicity  that  the  inputs  are  centered,  i.e.  E  j:„  =0,  and  that  all  eigenvalues 
are  distinct  and  positive.  For  j  =  1,  .  .  . ,  d,  let  u,  be  a  unit  length  eigenvector  of 
E  corresponding  to  the  eigenvalue  A,.  Then  W  is  optimal  iff  its  rows  span  the 
same  p-dimensional  subspace  of  iR  as  uj, .  .  . ,  Up,  i.e.  iff  W  =  RUp ,  where  R  is 
an  invertible  p  x  p  matrix  and  Up  —  [ui, . .  . ,  u^. 

One  class  of  PCA  learning  algorithms  which  have  been  proposed  in  the 
literature  can  be  described  as  follows,  see  e.g.  Baldi  k.  Hornik  (1991),  Hornik  k. 
Kuan  (1990).  \V  is  decomposed  as  W  =  MA,  where  M  is  an  p  x  p  matrix  witli 


all  diagonal  entries  equal  to  one.  In  particular,  we  could  have  M  =  /,  the  p  x  p 
unit  matrix.  The  algorithm  is 

^;+,     =     .4;+.  Q.4(x-„,.4;,A/^), 

with  y  —  W X  =  M Ax  and 

Qa{x,A,M)     =     yx'-Q^(.yy').4, 
Qm{x,A,M)    =    QM{yy')\ 

both  Qa  and  Qm  are  linear  operators  on  the  space  of  p  x  p  matrices. 
The  following  result  follows  immediately  from  our  main  theorem. 

Corollary  2.  Suppose  that  the  starting  values  Aq  and  A/o  are  independent  fro di 
f  and  that  the  input  sequence  {xn}  is  stationary  and  ergodic  luith  finite  fouilh 
moments.    Then 

(.4'(-).A/^(-))^(.4(),A/()), 

where  (.4(),  A/())  J5  the  solution  of  the  ODE 

A      =     MA^-Qa(MA'E:A'M')A,  AiO)    =  .4o, 

M     =     Qm(MAEA'M'),  A/(0)  =  A/o. 

If  we  take  Q.4  as  the  identity  mapping  and  M  =  /,  we  obtain  an  algoritlim 
introduced  independently  by  Williams  (1985)  as  the  SEC  (symmetric  error  cor- 
rection) algorithm,  by  Baldi  ( 1988)  fis  a  symmetric  simplification  of  the  BP  algo- 
rithm for  a  linear  d-p-d  architecture  in  autoassociative  mode,  and  by  Oja  (1989) 
as  the  subspace  algorithm;  for  more  details,  see  Baldi  k  Hornik  (1991).  Of 
course,  this  algorithm  is  a  generalization  of  the  one-unit  algorithm  introduced 
in  Oja  (1982)  as  a  first  order  approximation  to  normalized  hebbian  learning 
with  small  learning  rates.  In  this  case,  the  limiting  ODE  is 

.4  =  AE  -  A-^A'A. 

For  the  one-unit  case  (i.e.  p  =  \),  the  asymptotic  behavior  of  the  solutions 
of  this  ODE  is  completely  analyzed  in  Oja  &;  Karhunen  (1985).  It  can  be  shown 
that  the  solution  paths  always  converge  to  ±u\  unless  the  starting  value  is 
perpendicular  to  uy. 

For  p  >  I,  similar  global  results  do  not  seem  to  be  available.  It  can  be 
shown  that  all  full  rank  equilibrium  points  of  the  ODE  are  of  the  form  .4  = 
/?[u,| ,  .  .  . ,  u,^]',  where  I  <  iy  <  ■  ■  ■  <  ip  <  d  and  R  is  an  orthogonal  p  x  p 
matrix  (see  e.g.  Baldi  L  Hornik,  1991).  Therefore,  as  these  equilibrium  points 
are  not  isolated,  they  cannot  be  asymptotically  stable.  More  precisely,  Krogh  Ik. 
Hertz  (1990)  show  that  all  equilibria  with  {ii,  ■  ■  ■  ,ip}  ^  {1,  .  . .  ,p}  are  unstable 


and  that  for  equilibria  of  the  form  A  =  RUp,  only  the  components  of  small 
perturbations  about  the  equilibrium  .4  which  are  perpendicular  to  the  row  space 
of  A  die  out  asymptotically.  Thus,  one  might  expect  that  the  estimates  more 
or  less  "randomly"  walk  around  the  manifold 

A=  {A  =  RUp  :  R  orthogonal} 

rather  than  being  attracted  by  one  particular  equilibrium. 

These  stability  problems  disappear  if  instead  we  use  the  asymmetric  algo- 
rithm introduced  in  Sanger  (1989)  as  the  GHA  (generalized  hebbian  algorithm). 
In  this  case,  we  take  M  =  I  and  Qa  as  the  "lower"  operator  which  sets  all  en- 
tries of  an  p  X  p  matrix  which  are  above  the  main  diagonal  to  zero.  As  shown 
in  Sanger  (1989),  see  also  Hornik  k.  Kuan  (1990),  the  asymptotically  stable 
equilibria  of  the  associated  ODE 

.4  =  .4!]-  lower(.4i:.4').4 

are  given  by 

.4  =  [±ui ±Up]'. 

Therefore,  the  performance  of  the  GHA  should  be  as  good  as  the  one  of 
Oja's  one-unit  algorithm  (which  is  of  course  the  GHA  for  p  =  1),  and  it  should 
be  superior  to  the  symmetric  algorithm.  A  satisfactory  global  analysis  of  the 
asymptotic  behavior  of  the  solution  paths  of  the  above  ODE  has  not  been  carried 
out  thus  far.  Sanger  (1989,  p.  463)  claims  that  the  domain  of  attraction  of  the 
set  of  cisymptotically  stable  equilibria  consists  of  all  matrices  .4,  which  is  not 
true,  due  to  the  existence  of  equilibria  which  are  not  asymptotically  stable.  In 
fact,  it  is  easily  seen  that  if  the  rows  of  the  initial  ,4(0)  are  perpendicular  to 
some  u,,  then  the  same  is  true  for  all  A{t),  t  >0. 

Rubner  &  Tavian  (1990)  introduced  an  algorithm  where,  upon  presentation 
of  a  new  pattern  x,  A  is  updated  according  to  a  hebbian  learning  rule  with 
columnwise  normalization,  and  M  is  modified  using  an  asymmetric  (i.e.  hierar- 
chical) decorrelation  filter.  If  instead  we  use  Oja's  one-unit  algorithm  for  each 
of  the  rows  of  .4  (Baldi  h  Hornik,  1991),  we  obtain  another  algorithm  contained 
in  our  general  class,  with  the  choices  Q^  =  diag  and  Qm  =  — subdiag  ("'diag" 
respectively  "subdiag"  are  the  linear  operators  which  set  the  offdiagonal  respec- 
tively the  superdiagonal  entries  of  a  square  matrix  to  zero).  The  corresponding 
ODE  is 

.4     =     MAE-dmg(MAEA'M')A 
lit     =     -subdiag(A/.4S.4'.V/') 

with  the  appropriate  initial  conditions;  usually,  ^o  is  "random"  and  Mq  —  I. 
Hornik  &:  Kuan  (1990)  show  that  the  asymptotically  stable  equilibria  of  this 
ODE  are  given  by 

A  =  [±ui ±uJ',  M  =  I. 


The  global  asymptotic  behavior  of  the  solution  paths  has  not  been  described 
thus  far. 

Of  course,  many  other  PCA  learning  algorithms  exist.  Foldiak  (1989)  sug- 
gested an  algorithm  which  combines  Oja's  one-unit  algorithm  and  lateral  inhibi- 
tion terms.  As  this  algorithm  uses  a  feedback  rather  than  the  above  feedforward 
architecture,  it  cannot  be  dealt  with  in  our  framework,  because  the  learning 
patterns  r,,  then  consist  of  the  new  input  x^  and  some  feedback  term  yn  which 
depends  on  all  previous  inputs  and  weight  estimates  (the  case  of  "'state  depen- 
dent noise").  Hornik  k  Kuan  (1990)  analyze  the  asymptotic  behavior  of  such 
feedback  feature  extraction  algorithms  for  the  case  where  the  learning  rates  tend 
to  zero  at  a  suitable  rate;  the  case  of  constant  learning  rates  is  currently  being 
investigated. 


Appendix  -  Proof  of  the  theorem 

We  proceed  along  the  lines  of  theorem  5.1  in  Kushner  (1984).  (Our  notation  is 
different  from  Kushner's;  the  correspondencies  are  6  ^-*  x,  z  <-^  ^,  and  Q  <—  G; 
also  observe  that  in  our  case,  both  Q  and  {zj}  do  not  depend  upon  e.) 
Due  to  stationarity,  establishing  uniform  integrability  of 


{s'>P|9|<yv  IQ(~';,^)|,i>0} 


reduces  to  showing  that  E  sup|^|<^  \Q{~j ,  ^)|  <  ^"^  (cf.  Billingsley  ( 1968,  p.  32), 
which  follows  from 

sup|,|<^  |Q(c,,^)|     <     sup|,|<^  |Q(^,,^)-Q(;,,0)|  +  |Q(;;,0)| 

<  sup|,|<^  L^{zj)\e\  +  \Q{zj,0)\ 

<  NLr,~{zj)  +  \Q(z,,0)\, 

integrabiUty  of  L,\{zj )  and  square  integrability  of  Q(zj ,  0),  thereby  establishing 
(A5.2.1). 

For  |^|,|^|  <  A'  we  have 

Esup|,-_,|<,3  \Q{zJ,O)-Q{z,,0)\<^EL^-{zJ), 

hence  the  left  hand  side  tends  to  zero  as  S  ^  0,  establishing  (A  5. 2. 2. a).  Finally, 
stationarity  and  ergodicity  of  {r„}  ensure  that 

1       ""^  - 

TQ{=j,o)  -  EQ(zj,e)  =  Q{e) 

n  —  m  ■^— ' 

J  =m 

in  mean  square  as  n  —  77?  — ►  oo  (Doob,  1953,  theorem  X.6.1),  which  in  turn 
implies  (A  5. 2. 3. a). 

Theorem  5.1  in  Kushner  (1984)  now  yields  that  if  A'  is  given  the  Skorohod 
topology  (see  Kushner,  1984,  pages  30-33),  then  6^{-)  ^  6{-).  In  fact,  Kushner 
assumes  that  Too  =  oo;  however,  it  is  straightforward  to  see  that  everything  goes 
through  mutatis  mutandis  if  Too  <  oo.  Proceeding  along  the  lines  of  chapter  18 
in  Billingsley  (1968),  it  can  be  shown  that  we  also  have  B^-)  =>  9{-)  if  A'  is  given 
the  topology  of  uniform  convergence  on  bounded  subintervals. 

The  remaining  assertions  can  now  easily  be  established.  The  mappings 
X  €  A  t— ►  (x(/i),  .  .  .  ,x{ti))  and  x  €  A  i— ►  supo<,<x  k(0  —  y(Ol  for  fixed  (and 
nonrandom)  y  G  A  are  continuous  mappings  from  A'  to  IR  respectively  IR. 
Using  the  Continuous  Mapping  Theorem  (Billingsley,  1968,  theorem  5.1),  (7) 
follows  immediately  and  (8)  together  with  the  fact  that  weak  convergence  to  a 
nonrandom  limit  is  equivalent  to  convergence  in  probability  to  that  limit,  see 
e.g.  Billingsley  (1968,  p.  25). 
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